We learned the big 5 data transformation functions in the dplyr package:
| function | purpose |
|---|---|
filter() |
Pick observations by their values |
arrange() |
Reorder the rows |
select() |
Pick variables by their names |
mutate() |
Create new variables |
summarise() |
Create summaries |
And we used these functions to create this plot:
By the end of this session, you should be able to:
tidyr package, a member of the tidyverseCtrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspaceIs your project still open? If not, click on the project icon to load it. (Don’t create a new one.)
Run the following code in your console. Change products to your preferred subfolder.
download.file("https://tinyurl.com/y9ze2grp",
destfile = "products/lab-w05.Rmd")Tools > Global Options
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table (Wickham 2014)
Wickham also uses tidy as a verb meaning “to structure a dataset to facilitate analysis”.
How many variables are in this dataset? This table and the following from Wickham (2014).
Columns are variables and rows are observations (i.e., combination of religion and income):
How would you link up with population data by country and age group to calculate rates? (answer: after tidying)
The element column is not a variable with values, but rather a vector of variable names.
Every row becomes an observation (i.e., weather station by date) with two measurements:
This style of data entry and storage invites errors and inconsistencies:
| function | purpose |
|---|---|
gather() |
Gather variable values spread across multiple columns |
spread() |
Spread out observation values scattered across rows |
separate() |
Split one column into two or more columns |
unite() |
Collapse multiple columns into one column |
Load the tidyverse package and look at the included dataset called table4a.
library(tidyverse)
table4a## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
What is messy about this tibble?
gather()1999 and 2000 are values, not variables. The variable should be year.
Examples and figures from Wickham and Grolemund (2017)
gather() table4a %>% # remember pipes?
gather(`1999`, `2000`, key = "year", value = "cases")## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
We gather values by the column names key, which we name year. We then store the values in a new variable called cases.
gather()Now gather table4b (also included with tidyverse to create variables for year and population values (call it “population”).
table4b## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
spread()Now look at table2. type is not a variable!
table2## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
spread()Let’s use spread() to make two proper variables: cases and population.
spread()The key in this example is the type column, and the values we want to spread are stored in count.
spread()Just like with gather(), we’ll pass two arguments to spread() in addition to an object: key and value.
spread(table2, key = type, value = count)## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
spread out wide, or gather (stack) into a long (tall) pileseparate()We want split rate into two columns: cases and population
separate()Simple, just tell R the column to split and the columns to create:
table3 %>%
separate(rate, into = c("cases", "population"))## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()separate() is smart enough to know to split on “/”, but you can also specify with sep = "/". The convert==TRUE argument will guess at the data type rather than leave the new columns characters.
table3 %>%
separate(rate, into = c("cases", "population"),
sep="/", convert = TRUE)## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()Can also separate on a specific character position with the sep argument. For example, we separate year into century and year by specifying sep==2. Count from left (1, 2, …) or from right (-1, -2, …).
table3 %>%
separate(rate, into = c("century", "year"),
sep=2, convert = TRUE)## # A tibble: 6 x 4
## country year century year
## * <chr> <int> <int> <chr>
## 1 Afghanistan 1999 74 5/19987071
## 2 Afghanistan 2000 26 66/20595360
## 3 Brazil 1999 37 737/172006362
## 4 Brazil 2000 80 488/174504898
## 5 China 1999 21 2258/1272915272
## 6 China 2000 21 3766/1280428583
unite()unite() does the opposite: it combines two or more columns into one.
unite()In this example, new is the name of the new column we want to create by combining columns century and year.
table5 %>%
unite(new, century, year)## # A tibble: 6 x 3
## country new rate
## * <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
unite()By default unite() will add an underscore, but adding sep="" (no space) will combine without the _.
table5 %>%
unite(new, century, year, sep = "")## # A tibble: 6 x 3
## country new rate
## * <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
One last point to consider is how to handle missing values when transforming data. Let’s make a tibble called stocks with 2 years of quarterly data on returns.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)There are two types of missing data:
complete() returns all combinations of a set of columns and will fill in NA when combinations are missing.
stocks %>%
complete(year, qtr)## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. doi:10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.